nlp_architect.utils package

Submodules

nlp_architect.utils.ansi2html module

nlp_architect.utils.ansi2html.ansi2html(text, palette='solarized')[source]
nlp_architect.utils.ansi2html.run(file, out)[source]

nlp_architect.utils.embedding module

class nlp_architect.utils.embedding.ELMoEmbedderTFHUB[source]

Bases: object

get_vector(tokens)[source]
class nlp_architect.utils.embedding.FasttextEmbeddingsModel(size: int = 5, window: int = 3, min_count: int = 1, skipgram: bool = True)[source]

Bases: object

Fasttext embedding trainer class

Parameters:
  • texts (List[List[str]]) – list of tokenized sentences
  • size (int) – embedding size
  • epochs (int, optional) – number of epochs to train
  • window (int, optional) – The maximum distance between
  • current and predicted word within a sentence (the) –
classmethod load(path)[source]

load model from path

save(path) → None[source]

save model to path

train(texts: List[List[str]], epochs: int = 100)[source]
vec(word: str) → numpy.ndarray[source]

return vector corresponding given word

nlp_architect.utils.embedding.fill_embedding_mat(src_mat, src_lex, emb_lex, emb_size)[source]

Creates a new matrix from given matrix of int words using the embedding model provided.

Parameters:
  • src_mat (numpy.ndarray) – source matrix
  • src_lex (dict) – source matrix lexicon
  • emb_lex (dict) – embedding lexicon
  • emb_size (int) – embedding vector size
nlp_architect.utils.embedding.get_embedding_matrix(embeddings: dict, vocab: nlp_architect.utils.text.Vocabulary) → numpy.ndarray[source]

Generate a matrix of word embeddings given a vocabulary

Parameters:
  • embeddings (dict) – a dictionary of embedding vectors
  • vocab (Vocabulary) – a Vocabulary
Returns:

a 2D numpy matrix of lexicon embeddings

nlp_architect.utils.embedding.load_embedding_file(filename: str) → dict[source]

Load a word embedding file

Parameters:filename (str) – path to embedding file
Returns:dictionary with embedding vectors
Return type:dict
nlp_architect.utils.embedding.load_word_embeddings(file_path, vocab=None)[source]

Loads a word embedding model text file into a word(str) to numpy vector dictionary

Parameters:
  • file_path (str) – path to model file
  • vocab (list of str) – optional - vocabulary
Returns:

a dictionary of numpy.ndarray vectors int: detected word embedding vector size

Return type:

list

nlp_architect.utils.ensembler module

nlp_architect.utils.ensembler.simple_ensembler(np_arrays, weights)[source]

Simple ensembler takes a list of n by m numpy array predictions and a weight list The predictions should be n by m. n is the number of elements and m is the number of classes

Modified from the default LookupTable implementation to support multiple axis lookups.

Parameters:
  • vocab_size (int) – the vocabulary size
  • embed_dim (int) – the size of embedding vector
  • init (Initializor) – initialization function
  • update (bool) – if the word vectors get updated through training
  • pad_idx (int) – by knowing the pad value, the update will make sure always have the vector representing pad value to be 0s.

nlp_architect.utils.file_cache module

Utilities for working with the local dataset cache.

nlp_architect.utils.file_cache.cached_path(url_or_filename: Union[str, pathlib.Path], cache_dir: str = None) → str[source]

Given something that might be a URL (or might be a local path), determine which. If it’s a URL, download the file and cache it, and return the path to the cached file. If it’s already a local path, make sure the file exists and then return the path.

nlp_architect.utils.file_cache.filename_to_url(filename: str, cache_dir: str = None) → Tuple[str, str][source]

Return the url and etag (which may be None) stored for filename. Raise FileNotFoundError if filename or its stored metadata do not exist.

nlp_architect.utils.file_cache.get_from_cache(url: str, cache_dir: str = None) → str[source]

Given a URL, look for the corresponding dataset in the local cache. If it’s not there, download it. Then return the path to the cached file.

nlp_architect.utils.file_cache.http_get(url: str, temp_file: IO) → None[source]
nlp_architect.utils.file_cache.url_to_filename(url: str, etag: str = None) → str[source]

Convert url into a hashed filename in a repeatable way. If etag is specified, append its hash to the url’s, delimited by a period.

nlp_architect.utils.generic module

nlp_architect.utils.generic.add_offset(mat: numpy.ndarray, offset: int = 1) → numpy.ndarray[source]

Add +1 to all values in matrix mat

Parameters:
  • mat (numpy.ndarray) – A 2D matrix with int values
  • offset (int) – offset to add
Returns:

input matrix

Return type:

numpy.ndarray

nlp_architect.utils.generic.balance(df)[source]
nlp_architect.utils.generic.license_prompt(model_name, model_website, dataset_dir=None)[source]
nlp_architect.utils.generic.normalize(txt, vocab=None, replace_char=' ', max_length=300, pad_out=True, to_lower=True, reverse=False, truncate_left=False, encoding=None)[source]
nlp_architect.utils.generic.one_hot(mat: numpy.ndarray, num_classes: int) → numpy.ndarray[source]

Convert a 1D matrix of ints into one-hot encoded vectors.

Parameters:
  • mat (numpy.ndarray) – A 1D matrix of labels (int)
  • num_classes (int) – Number of all possible classes
Returns:

A 2D matrix

Return type:

numpy.ndarray

nlp_architect.utils.generic.one_hot_sentence(mat: numpy.ndarray, num_classes: int) → numpy.ndarray[source]

Convert a 2D matrix of ints into one-hot encoded 3D matrix

Parameters:
  • mat (numpy.ndarray) – A 2D matrix of labels (int)
  • num_classes (int) – Number of all possible classes
Returns:

A 3D matrix

Return type:

numpy.ndarray

nlp_architect.utils.generic.pad_sentences(sequences: numpy.ndarray, max_length: int = None, padding_value: int = 0, padding_style='post') → numpy.ndarray[source]

Pad input sequences up to max_length values are aligned to the right

Parameters:
  • sequences (iter) – a 2D matrix (np.array) to pad
  • max_length (int, optional) – max length of resulting sequences
  • padding_value (int, optional) – padding value
  • padding_style (str, optional) – add padding values as prefix (use with ‘pre’) or postfix (use with ‘post’)
Returns:

input sequences padded to size ‘max_length’

nlp_architect.utils.generic.to_one_hot(txt, vocab={'!': 40, '#': 49, '$': 50, '%': 51, '&': 53, '(': 61, ')': 62, '*': 54, '+': 57, ', ': 37, '-': 36, '.': 39, '/': 44, '0': 26, '1': 27, '2': 28, '3': 29, '4': 30, '5': 31, '6': 32, '7': 33, '8': 34, '9': 35, ':': 42, ';': 38, '<': 59, '=': 58, '>': 60, '?': 41, '@': 48, '[': 63, '\\': 45, ']': 64, '_': 47, 'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5, 'g': 6, 'h': 7, 'i': 8, 'j': 9, 'k': 10, 'l': 11, 'm': 12, 'n': 13, 'o': 14, 'p': 15, 'q': 16, 'r': 17, 's': 18, 't': 19, 'u': 20, 'v': 21, 'w': 22, 'x': 23, 'y': 24, 'z': 25, '{': 65, '|': 46, '}': 66, 'ˆ': 52, '˜': 55, '‘': 56, '’': 43})[source]

nlp_architect.utils.io module

nlp_architect.utils.io.check(validator)[source]
nlp_architect.utils.io.check_directory_and_create(dir_path)[source]

Check if given directory exists, create if not.

Parameters:dir_path (str) – path to directory
nlp_architect.utils.io.check_size(min_size=None, max_size=None)[source]
nlp_architect.utils.io.create_folder(path)[source]
nlp_architect.utils.io.download_unlicensed_file(url, sourcefile, destfile, totalsz=None)[source]

Download the file specified by the given URL.

Parameters:
  • url (str) – url to download from
  • sourcefile (str) – file to download from url
  • destfile (str) – save path
  • totalsz (int, optional) – total size of file
nlp_architect.utils.io.download_unzip(url: str, sourcefile: str, unzipped_path: str, license_msg: str = None)[source]

Downloads a zip file, extracts it to destination, deletes the zip file. If license_msg is supplied, user is prompted for download confirmation.

nlp_architect.utils.io.gzip_str(g_str)[source]

Transform string to GZIP coding

Parameters:g_str (str) – string of data
Returns:GZIP bytes data
nlp_architect.utils.io.json_dumper(obj)[source]

for objects that have members that cant be serialized and implement toJson() method

nlp_architect.utils.io.line_count(file)[source]

Utility function for getting number of lines in a text file.

nlp_architect.utils.io.load_files_from_path(dir_path, extension='txt')[source]

load all files from given directory (with given extension)

nlp_architect.utils.io.load_json_file(file_path)[source]

load a file into a json object

nlp_architect.utils.io.prepare_output_path(output_dir: str, overwrite_output_dir: str)[source]

Create output directory or throw error if exists and overwrite_output_dir is false

nlp_architect.utils.io.sanitize_path(path)[source]
nlp_architect.utils.io.uncompress_file(filepath: str, outpath='.')[source]

Unzip a file to the same location of filepath uses decompressing algorithm by file extension

Parameters:
  • filepath (str) – path to file
  • outpath (str) – path to extract to
nlp_architect.utils.io.valid_path_append(path, *args)[source]

Helper to validate passed path directory and append any subsequent filename arguments.

Parameters:
  • path (str) – Initial filesystem path. Should expand to a valid directory.
  • *args (list, optional) – Any filename or path suffices to append to path for returning.
  • Returns
    (list, str): path prepended list of files from args, or path alone if
    no args specified.
Raises:

ValueError – if path is not a valid directory on this filesystem.

nlp_architect.utils.io.validate(*args)[source]

Validate all arguments are of correct type and in correct range. :param *args: Each tuple represents an argument validation like so: :type *args: tuple of tuples :param Option 1 - With range check: (arg, class, min_val, max_val) :param Option 2 - Without range check: (arg, class) :param If class is a tuple of type objects check if arg is an instance of any of the types.: :param To allow a None valued argument, include type: :type To allow a None valued argument, include type: None :param To disable lower or upper bound check, set min_val or max_val to None, respectively.: :param If arg has the len attribute: :type If arg has the len attribute: such as string

nlp_architect.utils.io.validate_boolean(arg)[source]

Validates an input argument of type boolean

nlp_architect.utils.io.validate_existing_directory(arg)[source]

Validates an input argument is a path string to an existing directory.

nlp_architect.utils.io.validate_existing_filepath(arg)[source]

Validates an input argument is a path string to an existing file.

nlp_architect.utils.io.validate_existing_path(arg)[source]

Validates an input argument is a path string to an existing file or directory.

nlp_architect.utils.io.validate_parent_exists(arg)[source]

Validates an input argument is a path string, and its parent directory exists.

nlp_architect.utils.io.validate_proxy_path(arg)[source]

Validates an input argument is a valid proxy path or None

nlp_architect.utils.io.walk_directory(directory, verbose=False)[source]

Iterates a directory’s text files and their contents.

nlp_architect.utils.io.zipfile_list(filepath: str)[source]

List the files inside a given zip file

Parameters:filepath (str) – path to file
Returns:String list of filenames

nlp_architect.utils.metrics module

nlp_architect.utils.metrics.acc_and_f1(preds, labels)[source]

return accuracy and f1 score

nlp_architect.utils.metrics.accuracy(preds, labels)[source]

return simple accuracy in expected dict format

nlp_architect.utils.metrics.get_conll_scores(predictions, y, y_lex, unk='O')[source]

Get Conll style scores (precision, recall, f1)

nlp_architect.utils.metrics.pearson_and_spearman(preds, labels)[source]

get pearson and spearman correlation

nlp_architect.utils.metrics.simple_accuracy(preds, labels)[source]

return simple accuracy

nlp_architect.utils.metrics.tagging(preds, labels)[source]

nlp_architect.utils.mrc_utils module

nlp_architect.utils.mrc_utils.create_data_dict(data)[source]

Function to convert data to dictionary format

data: train/dev data as a list

a dictionary containing dev/train data

nlp_architect.utils.mrc_utils.create_squad_training(paras_file, ques_file, answer_file, data_train_len=None)[source]

Function to read data from preprocessed files and return data in the form of a list

paras_file: File name for preprocessed paragraphs ques_file: File name for preprocessed questions answer_file: File name for preprocessed answer spans vocab_file: File name for preprocessed vocab data_train_len= length of train dataset to use

appended list for train/dev dataset

nlp_architect.utils.mrc_utils.get_data_array_squad(params_dict, data_train, set_val='train')[source]

Function to pad all sentences and restrict to max length defined by user

params_dict: dictionary containing all input parameters data_train: list containing the training/dev data_train set_val: indicates id its a training set or dev set

Returns a list of tuples with padded sentences and masks

nlp_architect.utils.mrc_utils.get_qids(args, q_id_path, data_dev)[source]

Function to create a list of question_ids in dev set

q_id_path: path to question_ids file data_dev: development set

list of question ids

nlp_architect.utils.mrc_utils.max_values_squad(data_train)[source]

Function to compute the maximum length of sentences in paragraphs and questions

data_train: list containing the entire dataset

maximum length of question and paragraph

nlp_architect.utils.string_utils module

class nlp_architect.utils.string_utils.StringUtils[source]

Bases: object

determiners = []
static find_head_lemma_pos_ner(x: str)[source]

Parameters:x – mention
Returns:the head word and the head word lemma of the mention
static is_determiner(in_str: str) → bool[source]
static is_preposition(in_str: str) → bool[source]
static is_pronoun(in_str: str) → bool[source]
static is_stop(token: str) → bool[source]
static normalize_str(in_str: str) → str[source]
static normalize_string_list(str_list: str) → List[str][source]
preposition = []
pronouns = []
spacy_no_parser = <nlp_architect.utils.text.SpacyInstance object>
spacy_parser = <nlp_architect.utils.text.SpacyInstance object>
stop_words = []

nlp_architect.utils.testing module

class nlp_architect.utils.testing.NLPArchitectTestCase(methodName='runTest')[source]

Bases: unittest.case.TestCase

setUp()[source]

Hook method for setting up the test fixture before exercising it.

tearDown()[source]

Hook method for deconstructing the test fixture after testing it.

nlp_architect.utils.text module

class nlp_architect.utils.text.SpacyInstance(model='en', disable=None, display_prompt=True)[source]

Bases: object

Spacy pipeline wrapper which prompts user for model download authorization.

Parameters:
  • model (str, optional) – spacy model name (default: english small model)
  • disable (list of string, optional) – pipeline annotators to disable (default: [])
  • display_prompt (bool, optional) – flag to display/skip license prompt
parser

return Spacy’s instance parser

tokenize(text: str) → List[str][source]

Tokenize a sentence into tokens :param text: text to tokenize :type text: str

Returns:a list of str tokens of input
Return type:list
class nlp_architect.utils.text.Stopwords[source]

Bases: object

Stop words list class.

static get_words()[source]
stop_words = []
class nlp_architect.utils.text.Vocabulary(start=0, include_oov=True)[source]

Bases: object

A vocabulary that maps words to ints (storing a vocabulary)

add(word)[source]

Add word to vocabulary

Parameters:word (str) – word to add
Returns:id of added word
Return type:int
add_vocab_offset(offset)[source]

Adds an offset to the ints of the vocabulary

Parameters:offset (int) – an int offset
id_to_word(wid)[source]

Word-id to word (string)

Parameters:wid (int) – word id
Returns:string of given word id
Return type:str
max
reverse_vocab()[source]

Return the vocabulary as a reversed dict object

Returns:reversed vocabulary object
Return type:dict
vocab

get the dict object of the vocabulary

Type:dict
word_id(word)[source]

Get the word_id of given word

Parameters:word (str) – word from vocabulary
Returns:int id of word
Return type:int
nlp_architect.utils.text.bio_to_spans(text: List[str], tags: List[str]) → List[Tuple[int, int, str]][source]

Convert BIO tagged list of strings into span starts and ends :param text: list of words :param tags: list of tags

Returns:list of start, end and tag of detected spans
Return type:tuple
nlp_architect.utils.text.char_to_id(c)[source]
return int id of given character
OOV char = len(all_letter) + 1
Parameters:c (str) – string character
Returns:int value of given char
Return type:int
nlp_architect.utils.text.character_vector_generator(data, start=0)[source]

Character word vector generator util. Transforms a list of sentences into numpy int vectors of the characters of the words of the sentence, and returns the constructed vocabulary

Parameters:
  • data (list) – list of list of strings
  • start (int, optional) – vocabulary index start integer
Returns:

a 2D numpy array Vocabulary: constructed vocabulary

Return type:

np.array

nlp_architect.utils.text.extract_nps(annotation_list, text=None)[source]

Extract Noun Phrases from given text tokens and phrase annotations. Returns a list of tuples with start/end indexes.

Parameters:
  • annotation_list (list) – a list of annotation tags in str
  • text (list, optional) – a list of token texts in str
Returns:

list of start/end markers of noun phrases, if text is provided a list of noun phrase texts

nlp_architect.utils.text.id_to_char(c_id)[source]

return character of given char id

nlp_architect.utils.text.read_sequential_tagging_file(file_path, ignore_line_patterns=None)[source]

Read a tab separated sequential tagging file. Returns a list of list of tuple of tags (sentences, words)

Parameters:
  • file_path (str) – input file path
  • ignore_line_patterns (list, optional) – list of string patterns to ignore
Returns:

list of list of tuples

nlp_architect.utils.text.simple_normalizer(text)[source]

Simple text normalizer. Runs each token of a phrase thru wordnet lemmatizer and a stemmer.

nlp_architect.utils.text.spacy_normalizer(text, lemma=None)[source]

Simple text normalizer using spacy lemmatizer. Runs each token of a phrase thru a lemmatizer and a stemmer. :param text: the text to normalize. :type text: string :param lemma: lemma of the given text. in this case only stemmer will :type lemma: string :param run.:

nlp_architect.utils.text.try_to_load_spacy(model_name)[source]
nlp_architect.utils.text.word_vector_generator(data, lower=False, start=0)[source]

Word vector generator util. Transforms a list of sentences into numpy int vectors and returns the constructed vocabulary

Parameters:
  • data (list) – list of list of strings
  • lower (bool, optional) – transform strings into lower case
  • start (int, optional) – vocabulary index start integer
Returns:

2D numpy array and Vocabulary of the detected words

Module contents